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1 GPGPU: general purpose computation on graphics hardware §| 
<|k David Luebke, Mark Harris, Jens Kruger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff 

^ Woolley, Aaron Lefohn 

August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes 

SIGGRAPH '04 
Publisher: ACM Press 

Full text available: ^ pdf( 63.03 MB ) Additional Information: full citation , abstract 

The graphics processor (GPU) on today's commodity video cards has evolved into an 
extremely powerful and flexible processor. The latest graphics architectures provide 
tremendous memory bandwidth and computational horsepower, with fully programmable 
vertex and pixel processing units that support vector operations up to full IEEE floating 
point precision. High level languages have emerged for graphics hardware, making this 
computational power accessible. Architecturally, GPUs are highly parallel s ... 

2 Ifae implications of cache affinity on processor schedulin g for multipro q rammed . §j 
shared memory multiprocessors 

Raj Vaswani, John Zahorjan 

September 1991 ACM SIGOPS Operating Systems Review , Proceedings of the 

thirteenth ACM symposium on Operating systems principles SOSP 

"91, Volume 25 Issue 5 
Publisher: ACM Press 



Full text available: ^pdf (157 MB ) 



Additional Information: full citation , abstract , references , citings, index 
terms 



In a shared memory multiprocessor with caches, executing tasks develop "affinity" to 
processors by filling their caches with data and instructions during execution. A scheduling 
policy that ignores this affinity may waste processing power by causing excessive cache 
refilling. Our work focuses on quantifying the effect of processor reallocation on the 
performance of various parallel applications multiprogrammed on a shared memory 
multiprocessor, and on evaluating how the magnitude of this cost aff ... 

Computing curricula 2001 

September 2001 Journal on Educational Resources in Computing (JERIC) 

Publisher: ACM Press 

Full text available: pdf( 613.63 KB) . , . , , , 

html(2.78 KB ) Additional Information: full citation , references , citing s, index terms 
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4 Cache Memories 
Alan Jay Smith 

September 1982 ACM Computing Surveys (CSUR), Volume 14 Issue 3 
Publisher: ACM Press 

Full text available: ^ pdf(4.61 MB) Additional Information: full citation , references , citings , index terms 




5 A_dynamic processor allocation policy for multi prog rammed shared-memory 

6 multiprocessors 

^ Cathy McCann, Raj Vaswani, John Zahorjan 

May 1993 ACM Transactions on Computer Systems (TOCS), volume n issue 2 
Publisher: ACM Press 

Full text available: 13 pdf (2.26 MB ) Additional Information: full citation , abstract , references , citing s, index 

terms 

We propose and evaluate empirically the performance of a dynamic processor-scheduling 
policy for multiprogrammed shared-memory multiprocessors. The policy is dynamic in that 
it reallocates processors from one parallel job to another based on the currently realized 
parallelism of those jobs. The policy is suitable for implementation in production systems 
in that: —It interacts well with very efficient user-level thread packages, leaving to them 
many low-level thr ... 

Keywords: shared memory parallel processors, threads, two-level scheduling 



External memory algorithms and data structures: dealing with massive data 
Jeffrey Scott Vitter 

June 2001 ACM Computing Surveys (CSUR), volume 33 issue 2 
Publisher: ACM Press 

Full text available- fj£| pdf(828 46 KB) Add ' tiona ' information: full citation , abstract , references , citings, index 

t erms 

Data sets in large applications are often too massive to fit completely inside the computers 
internal memory. The resulting input/output communication (or I/O) between fast internal 
memory and slower external memory (such as disks) can be a major performance 
bottleneck. In this article we survey the state of the art in the design and analysis of 
external memory (or EM) algorithms and data structures, where the goal is to exploit 
locality in order to reduce the I/O costs. We consider a varie ... 

Keywords: B-tree, I/O, batched, block, disk, dynamic, extendible hashing, external 
memory, hierarchical memory, multidimensional access methods, multilevel memory, 
online, out-of-core, secondary storage, sorting 



The priority-based colorin g approach to register allocation 
Fred C. Chow, John L. Hennessy 

October 1990 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 12 Issue 4 
Publisher: ACM Press 

Full text available: fj3 pdf(2.97 MB) Additional Information: full citation , abstract , references , citings , index 

terms, review 

Global register allocation plays a major role in determining the efficacy of an optimizing 
compiler. Graph coloring has been used as the central paradigm for register allocation in 
modern compilers. A straightforward coloring approach can suffer from several 
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shortcomings. These shortcomings are addressed in this paper by coloring the graph using 
a priority ordering. A natural method for dealing with the spilling emerges from this 
approach. The detailed algorithms for a priority-based colori ... 

8 Real-time shadin g 

^ Marc Olano, Kurt Akeley, John C. Hart, Wolfgang Heidrich, Michael McCool, Jason L. Mitchell, 
^ Randi Rost 

August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes 
SIGGRAPH '04 

Publisher: ACM Press 

Full text available: ^| pdf( 7.39 MB ) Additional Information: full citation , abstract 

Real-time procedural shading was once seen as a distant dream. When the first version of 
this course was offered four years ago, real-time shading was possible, but only with one- 
of-a-kind hardware or by combining the effects of tens to hundreds of rendering passes. 
Today, almost every new computer comes with graphics hardware capable of interactively 
executing shaders of thousands to tens of thousands of instructions. This course has been 
redesigned to address today's real-time shading capabili ... 

9 Level set and PDE methods for computer g raphics 

A David Breen, Ron Fedkiw, Ken Museth, Stanley Osher, Guillermo Sapiro, Ross Whitaker 
August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes 

SIGGRAPH '04 
Publisher: ACM Press 

Full text available: ^pdf( 17.07 MB ) Additional Information: full citation , abstract , citings 

Level set methods, an important class of partial differential equation (PDE) methods, 
define dynamic surfaces implicitly as the level set (iso-surface) of a sampled, evolving nD 
function. The course begins with preparatory material that introduces the concept of using 
partial differential equations to solve problems in computer graphics, geometric modeling 
and computer vision. This will include the structure and behavior of several different types 
of differential equations, e.g. the level set eq ... 

10 A parallel, incremental, mostly concurrent g arba g e collector for servers 

<g> Katherine Barabash, Ori Ben-Yitzhak, Irit Goft, Elliot K. Kolodner, Victor Leikehman, Yoav 
^ Ossia, Avi Owshanko, Erez Petrank 

November 2005 ACM Transactions on Programming Languages and Systems 

(TOPLAS), Volume 27 Issue 6 
Publisher: ACM Press 

Full text available: ^ pdf( 683.5Q KB) Additional Information: full citation , abstract , references , index terms 

Multithreaded applications with multigigabyte heaps running on modern servers provide 
new challenges for garbage collection (GC). The challenges for "server-oriented" GC 
include: ensuring short pause times on a multigigabyte heap while minimizing throughput 
penalty, good scaling on multiprocessor hardware, and keeping the number of expensive 
multicycle fence instructions required by weak ordering to a minimum. We designed and 
implemented a collector facing these demands building on th ... 

Keywords: Garbage collection, JVM, concurrent garbage collection 



11 Scheduler activations: effective kernel support for the user-level management of 
<§> p arallelism 

Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, Henry M. Levy 
February 1992 ACM Transactions on Computer Systems (TOCS), volume 10 issue l 
Publisher: ACM Press 
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Full text available: ^ pdf (2.04 MB) Additional Information: full citation , abstract , references , citings, index 

terms , review 

Threads are the vehicle for concurrency in many approaches to parallel programming. 
Threads can be supported either by the operating system kernel or by user-level library 
code in the application address space, but neither approach has been fully satisfactory. 
This paper addresses this dilemma. First, we argue that the performance of kernel threads 
is inherently worse than that of user-level threads, rather than this being an artifact of 
existing ... 

Keywords: multiprocessor, thread 



12 A simple interprocedural register allocation algorithm and its effectiveness for LISP 
A Peter A. Steenkiste, John L. Hennessy 

January 1989 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 11 Issue 1 
Publisher: ACM Press 

Full text available: Wl pdf(2.56 MB) Additional Information: full citation , abstract , references , citings , index 

terms, review 

Register allocation is an important optimization in many compilers, but with per- procedure 
register allocation, it is often not possible to make good use of a large register set. 
Procedure calls limit the improvement from global register allocation, since they force 
variables allocated to registers to be saved and restored. This limitation is more 
pronounced in LISP programs due to the higher frequency of procedure calls. An 
interprocedural register allocation algorithm is developed by simp ... 

13 Streamlinin g data cache access with fast address calculation 
^ Todd M. Austin, Dionisios N. Pnevmatikatos, Gurindar S. Sohi 

V May 1995 ACM SIGARCH Computer Architecture News , Proceedings of the 22nd 

annual international symposium on Computer architecture ISCA '95, volume 

23 Issue 2 

Publisher: ACM Press 

Full text available: tS| pdf(1.58 MB) Additional Information: full citation, abstract, reference s, citings, index 

terms 

For many programs, especially integer codes, untolerated load instruction latencies 
account for a significant portion of total execution time. In this paper, we present the 
design and evaluation of a fast address generation mechanism capable of eliminating the 
delays caused by effective address calculation for many loads and stores. Our approach 
works by predicting early in the pipeline (part of) the effective address of a memory 
access and using this predicted address to speculatively access the ... 

14 Query evaluation techniques for large databases 
Goetz Graefe 

June 1993 ACM Computing Surveys (CSUR), volume 25 issue 2 
Publisher: ACM Press 

Full text available: f£| pdf(9L37 MB) Additional Information: full citation , abstract , references , citings , index 
— terms , review 

Database management systems will continue to manage large data volumes. Thus, 
efficient algorithms for accessing and manipulating large sets and sequences will be 
required to provide acceptable performance. The advent of object-oriented and extensible 
database systems will not solve this problem. On the contrary, modern data models 
exacerbate the problem: In order to manipulate large sets of complex objects as efficiently 
as today's database systems manipulate simple records, query-processi ... 
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Keywords: complex query evaluation plans, dynamic query evaluation plans, extensible 
database systems, iterators, object-oriented database systems, operator model of 
parallelization, parallel algorithms, relational database systems, set-matching algorithms, 
sort-hash duality 



Avoidance and su p pression of compensation code in a trace schedulin g compiler 
^ Stefan M. Freudenberger, Thomas R. Gross, P. Geoffrey Lowney 

>^ July 1994 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 16 Issue 4 
Publisher: ACM Press 

Full text available: fgl pdf(3.58 MB) Additional Information: full citation , abstract , references , citings, index 

terms , review 

Trace scheduling is an optimization technique that selects a sequence of basic blocks as a 
trace and schedules the operations from the trace together. If an operation is moved 
across basic block boundaries, one or more compensation copies may be required in the 
off-trace code. This article discusses the generation of compensation code in a trace 
scheduling compiler and presents techniques for limiting the amount of compensation 
code: avoidance (restricting code motion so that no compensatio ... 

Keywords: SPEC89, instruction-level parallelism, performance evaluation, trace 
scheduling 



16 Scheduler activa tions: effective kernel su p port for the user-level mana g ement of 
<g> parallelism 

Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, Henry M. Levy 
September 1991 ACM SIGOPS Operating Systems Review , Proceedings of the 

thirteenth ACM symposium on Operating systems principles SOSP 

'91, Volume 25 Issue 5 
Publisher: ACM. Press 

Full text available: S pdf( 1,68 MB) Additional Information: full citation, abstract , references , citings, index 

terms 

Threads are the vehicle for concurrency in many approaches to parallel programming. 
Threads separate the notion of a sequential execution stream from the other aspects of 
traditional UNIX-like processes, such as address spaces and I/O descriptors. The objective 
of this separation is to make the expression and control of parallelism sufficiently cheap 
that the programmer or compiler can exploit even fine-grained parallelism with acceptable 
overhead. Threads can be supported either by the op ... 

17 Eliminating synchronization bottlenecks using adaptive replication 
Martin C. Rinard, Pedro C. Diniz 

May 2003 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 25 Issue 3 
Publisher: ACM Press 

Full text available- W( pdf (826.28 KB ) Additional Information: full citation , abstract , references , index terms . 

review 

This article presents a new technique, adaptive replication, for automatically eliminating 
synchronization bottlenecks in multithreaded programs that perform atomic operations on 
objects. Synchronization bottlenecks occur when multiple threads attempt to concurrently 
update the same object. It is often possible to eliminate synchronization bottlenecks by 
replicating objects. Each thread can then update its own local replica without 
synchronization and without interacting with other threads. When ... 

Keywords: Atomic operations, commutativity analysis, parallel computing, parallelizing 
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18 4.2BSD and 4.3BSD as examples of the UNIX system j| 
^ John S. Quarterman, Abraham Silberschatz, James L Peterson 

>^ December 1985 ACM Computing Surveys (CSUR), volume 17 issue 4 
Publisher: ACM Press 

Full text available: pdf( 4.07 MB) Additional Information: full citation , abstract , references , citings , index 

terms , review 

This paper presents an in-depth examination of the 4.2 Berkeley Software Distribution, 
Virtual VAX-11 Version (4.2BSD), which is a version of the UNIX Time-Sharing System. 
There are notes throughout on 4.3BSD, the forthcoming system from the University of 
California at Berkeley. We trace the historical development of the UNIX system from its 
conception in 1969 until today, and describe the design principles that have guided this 
development We then present the internal data structures and ... 

19 Empiri cal performance evaluation of concurrency and coherency control protocols for jj[ 
databas e sharing sy st ems 

^ Erhard Rahm 

June 1993 ACM Transactions on Database Systems (TODS), volume 18 issue 2 
Publisher: ACM Press 

Full text available:fg| pdf(3.3 7 M B) Additional Information: full citation, abs.tj.act, references , citings, index 

terms , review 

Database Sharing (DB-sharing) refers to a general approach for building a distributed high 
performance transaction system. The nodes of a DB-sharing system are locally coupled via 
a high-speed interconnect and share a common database at the disk level. This is also 
known as a "shared disk" approach. We compare database sharing with the database 
partitioning (shared nothing) approach and discuss the functional DBMS components that 
require new and coordinated solutions for DB-shar ... 

Keywords: coherency control, concurrency control, database partitioning, database 
sharing, performance analysis, shared disk, shared nothing, trace-driven simulation 

20 Idle time schedulin g with preemption intervals Q 
Lars Eggert, Joseph D. Touch 

October 2005 ACM SIGOPS Operating Systems Review , Proceedings of the twentieth 
ACM symposium on Operating systems principles SOSP '05, volume 39 issue 

Publisher: ACM Press 

Full text available: ^,pdf( 2.02 MB) Additional Information: full citation , abstract , references , index terms 

This paper presents the idletime scheduler; a generic, kernel-level mechanism for using 
idle resource capacity in the background without slowing down concurrent foreground use. 
Many operating systems fail to support transparent background use and concurrent 
foreground performance can decrease by 50% or more. The idletime scheduler minimizes 
this interference by partially relaxing the work conservation principle during preemption 
intervals, during which it serves no background requests eve ... 

Keywords: background processing, disk scheduler, idletime scheduling, network 
scheduler, preemption interval, queuing, resource scheduler 
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